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Abstract 

This paper presents a comprehensive 
empirical comparison between two ap- 
proaches for developing a base noun 
phrase chunker: human rule writing and 
active learning using interactive real- 
time human annotation. Several novel 
variations on active learning are inves- 
tigated, and underlying cost models for 
cross-modal machine learning compari- 
son are presented and explored. Results 
show that it is more efficient and more 
successful by several measures to train a 
system using active learning annotation 
rather than hand-crafted rule writing at 
a comparable level of human labor in- 
vestment. 

1 Introduction 

One of the primary problems that NLP re- 
searchers who work in new languages or new do- 
mains encounter is a lack of available annotated 
data. Collection of data is neither easy nor cheap. 
The construction of the Penn Treebank signifi- 
cantly improved performance for English systems 
dealing in the "traditional" NLP domains (eg 
parsing, part-of-speech tagging, etc). However, 
for a new language, a similar investment of ef- 
fort in time and money is most likely prohibitive, 
if not impossible. 

Faced with the costs associated with data ac- 
quisition, rationalists may argue that it would be 
more cost effective to construct systems of hand- 
coded rule lists that capture the linguistic charac- 
teristics of the task at hand, rather than spending 
comparable effort annotating data and expecting 
the same knowledge to be acquired indirectly by 
a machine learning system. The question we are 
trying to address then is: for a given cost assump- 
tion, which approach would be the most effective. 

Although learning curves showing performance 
relative to amount of training data are common 



in the machine learning literature, these are inade- 
quate for comparing systems with different sources 
of training data or supervision. This is especially 
true when a human rule-based approach and em- 
pirical learning are evaluated relative to effort in- 
vested. Such a multi-factor cost analysis is long 
overdue. 

This paper will conclude with a comprehensive 
cost model exposition and analysis, and an em- 
pirical study contrasting human rule-writing ver- 
sus annotation-based learning approaches that are 
sensitive to these cost models. 

2 Base Noun Phrase Chunking 



The domain in which our experiments are per- 
formed is base noun phrase chunking. A sig- 
nificant amount of work has been done in this 
domain and many different methods have been 
applied: Church's PARTS QL988D p rogram used 
a Markov model; Bourigault ( 1992| ) used heuris- 
tics along with a grammar; Voutilainen's NPTool 
( [1993 ) used a lexicon combined with a constraint 
grammar; Juteson and Katz ( 1995| ) used repeated 
phrases; Veenstra ( |f 998| ), Argamon, Dagan & 
Krymolowski ( 1998| ), Daelemans, van den Bosch 
& Zavrel (1999) and Tjong Kim Sang & Veenstra 
(1999) used memory-based systems; Ramshaw & 
Marcus Q1999D and Cardie & Pierc e fll998| ) used 
rule-based systems, Munoz et al. ( 1999| ) used a 
Winno w-bas ed system, and the XTAG Research 
Group(1998) used a tree-adjoining grammar. 

Of all the systems, Ramshaw & Marcus' trans- 
formation rule-based system had the best pub- 
lished performance (f-measure 92.0) for several 
years, and is regarded as the de facto standard 
for the domain. Although several systems have 
recently achieved slightly higher published results 
(Munoz et al.: 92.8, Tjong Kim Sang & Veenstra: 
92.37, XTAG Research Group: 92.4), their algo- 
rithms are significantly more costly, or not fea- 
sible, to implement in an active learning frame- 
work. To facilitate contrastive studies, we have 
evaluated our active learning and cost model com- 



parisons using Ramshaw & Marcus' system as the 
reference algorithm in these experiments. 

3 Active Learning from 
Annotation 

Supervised statistical machine learning systems 
have traditionally required large amounts of anno- 
tated data from which to extract linguistic prop- 
erties of the task at hand. However, not all data 
is created equal. A random distribution of an- 
notated data contains much redundant informa- 
tion. By intelligently choosing the training exam- 
ples which get passed to the learner, it is possible 
to provide the necessary amount of information 
with less data. 

Active learning attempts to perform this intel- 
ligent sampling of data to reduce annotation costs 
without damaging performance. In general, these 
methods calculate the usefulness of an example by 
first having the learner classify it, and then seeing 
how uncertain that classification was. The idea is 
that the more uncertain the example, the less well 
modeled this situation is, and therefore, the more 
useful it would be to have this example annotated. 

3.1 Prior Work in Active Learning 



Scung, Opp er and Sompolinsky ( 1992 ) and Fre- 
und et al. ( 1997 ) proposed a theoretical query- 
by- committee approach. Such an approach uses 
multiple models (or a committee) to evaluate the 
data, and candidates for annotation (or queries) 
are drawn from the pool of examples in which 
the models disagree. Furthermore, Freund et al. 
prove that, under some situations, the generaliza- 
tion error decreases exponentially with the num- 
ber of queries. 

On the experimental side, active learning has 
been applied to several different problems. Lewis 



& Gale ( [1994), Lew is & Catlett fll994j ) and Liere 
& Tadepalli (1997) all applied it to text catego- 
rization; Engelson & Dagan (1996) applied it to 
part-of-speech tagging. 

Each approach has its own way of determin- 
ing uncertainty in examples. Lewis & Gale used 
a probabilistic classifier and picked the examples 
e whose class-conditional a posteriori probability 
P{C\e) is closest to 0.5 (for a 2-class problem). 
Engelson & Dagan implemented a committee of 
learners, and used vote entropy to pick examples 
which had the highest disagreement among the 
learners. In addition, Engelson & Dagan also in- 
vestigate several different selection techniques in 
depth. 



3.2 New Applications and Algorithmic 
Extensions in Active Learning 

To our knowledge, this paper constitutes the 
first work to apply active learning to base noun 
phrase chunking, or to apply active learning to 



a transformation-learning paradigm (Brill, 1995) 



for any application. Since a transformation-based 
learner does not give a probabilistic output, we are 
not able to use Lewis & Gale's method for deter- 
mining uncertainty. Our experimental framework 
thus uses the query by committee paradigm with 
batch selection: 

1. Given a corpus C, arbitrarily pick t sentences 
for annotation. 

2. Have these t sentences hand-annotated, 
delete them from C and put them into a train- 
ing set, T. 

3. Divide T into m non-identical, but not nec- 
essarily non-overlapping, subsets. 

4. Use each subset as the training set for a 
model. 

5. Evaluate each model on the remaining sen- 
tences in C 

6. Using a measure of disagreement D, pick the 
x sentences in C with the highest D for an- 
notation. 

7. Delete the x sentences from C, have them 
annotated, and add them to T. 

8. Repeat from || 

In our experiments, the initial corpus C that we 
used consisted of se ctions 15-18 of the W all Street 
Journal Treebank ( Marcus et al., 1993| ), which is 
also the training set used by Ramshaw & Mar- 
cus (1999). The initial t sentences were the first 
100 sentences of the training corpus, and x = 50 
sentences were picked at each iteration. Sets of 
50 sentences were selected because it takes ap- 
proximately 15-30 minutes for humans to anno- 
tate them, a reasonable amount of work and time 
for the annotator to spend before taking a break 
while the machine selects the next set. The pa- 
rameter m, which denotes the number of models 
to train, was set at 3, which could be expected 
to give us reasonable labelling variation over the 
samples, but also would not cause the processing 
phase to take a long time. 

To divide the corpus into the different subsets 
in Step |[ we tried using two approaches: bagging 
and n-fold partitioning. In bagging, we randomly 
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Figure 1: Performance vs. training set size: active learning and sequential annotation on Treebank 
data 



pick (with replacement) | of the total number of 
sentences in C to assign to each subset. With 
n-fold partitioning, we partitioned the data into 
3 discrete partitions, and each model was then 
trained on 2 of the 3 partitions. We found no 
significant difference between the two methods. 

3.2.1 Models of Disagreement for the 
Selection of New Data 

The standard method for measuring disagree- 
ment for sample selection in active learning algo- 
rithms that use the query by committee is En- 
gelson & Dagan's vote entropy measure. Given a 
tagged example e 1 , the disagreement D for e is^]: 



sure on the f-measure metric, which is defined as: 



D 



—y 



V(c ' e) log y(c ' e) 
log k ^— ' k k 



where 

k — Number of models in the committee. 
V(c, e) = Number of models assigning c to e 

However, here we propose a novel disagreement 
measure that is both more applicable and achieves 
slightly improved performance. We base our mea- 



1 Vote entropy calculates the disagreement on a per 
tagged unit basis. In domains such as part-of-speech 
tagging or base noun phrase chunking, each tagged 
unit is a word. We prefer to select entire sentences 
as candidates for annotation. In situations like these, 
the disagreement over the entire sentence is simply the 
mean disagreement over the words in the sentence. 

2 Dividing by log k normalizes for the number of 
models 
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1) x Precision x Recall 



/3 2 x Precision + Recall 

# of correct proposed labellings 

# of proposed labellings 

# of correct proposed labellings 

# of correct labellings 



The variable (3 allows precision and recall to be 
weighed differently. In all our experiments, /3 is 
set to 1, giving an equal weight to both precision 
and recall. 

For our disagreement measure D, we use the 
/- complement, which is calculated as: 
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where /C is the committee of models, Mi,Mj are 
individual models in K,, and F\{Mi(e), Mj(e)) is 
Fi of Mi's labelling of e relative to Mj's evaluation 
of e.[] 

Figure [j] shows the test set performance against 
the number of words in the training corpus for 
sequential annotation and active learning, using 
vote entropy and f-complcmcnt as the measures 
of disagreement. As can be seen from the graphs, 
f-complement gives a small empirical boost in per- 
formance. More importantly, f-complement can 
be used in applications where implementation of 
vote entropy is difficult, for example, parsing. The 



P = 1 makes the F-measure symmetrical. 



comparison between systems trained on annotated 
sentences selected by active learning and anno- 
tated sentences selected sequentially shows that 
active learning reduces the amount of data needed 
to reach a given level of performance by approxi- 
mately a factor of two. 

3.3 Active Learning with Real Time 
Human Supervision 

Most of the published work on active learning are 
simulations of an idealized situation. One has a 
large annotated corpus, and the new tags for the 
"newly annotated" sentences are simply drawn 
from what was observed in the annotated corpus, 
as if the gold standard annotator was producing 
this feedback in real time, while the test set it- 
self is, of course, not used for this feedback. This 
is an idealized situation, since it assumes that a 
true active learning situation would have access 
to someone who could annotate with perfect con- 
sistency to the gold standard corpus annotation 
conventions. 

Because our goal is to investigate the relative 
costs of rule writing versus annotation, it is es- 
sential that we use a realistic model of annota- 
tion. Therefore, we decided to do a fully-hedged 
active learning annotation experiment, with real 
time human supervision, rather than assume the 
simulated feedback of actual Treebank annotators. 

We developed an annotation tool that is mod- 
eled on MITRE's Alembic Workbench software 
(Day et al., 1997), but written in Java for 
platform-independence. To enable data storage 
and the active learning sample selection to take 
place on the more powerful machines in our lab 
rather than the user's home machine, the tool was 
designed with network support so that it could 
communicate with our servers over the internet. 

Our real-time active learning experiment sub- 
jects were seven graduate students in computer 
science. Five of them are native English speak- 
ers, but none had any formal linguistics training. 
The initial training set T is the hrst 100 sentences 
of Ramshaw & Marcus' training set. To acquaint 
the subjects with the Treebank conventions, they 
were hrst asked to spend some time in a feedback 
phase, where they would annotate up to 50 sen- 
tences (they were allowed to stop at any time) 
drawn from the initial 100 sentences in T. The 
sentences were annotated one at a time, and the 
Treebank annotation was shown to them after ev- 
ery sentence. On average, the annotators spent 
around 15 minutes on this feedback phase before 
deciding that they were comfortable enough with 
the convention. 

The active learning phase follows the feedback 



phase. The f-complement disagreement measure 
was used to select 50 sentences from the rest of 
Ramshaw & Marcus' training set and the annota- 
tor was instructed to annotate them. The anno- 
tated sentences were then sent back to the server. 
The system chose the next 50 sentences. The ex- 
periment consists of 10 iterations, during which 
the annotators were allowed to make use of the 
original 100 sentences as a reference corpus. Af- 
ter completing all 10 iterations, they were asked 
to annotate a further 100 consecutive sentences 
drawn randomly from the test set. The purpose 
of this hnal annotation was to judge how well an- 
notators tag sentences drawn with the true dis- 
tribution from the test corpus, as we shall see in 
section |^. 

On average, the annotators took 17 minutes to 
annotate each set of 50 sentences, ranging from 8 
to 30 minutes. The average amount of time the 
server took to run the active learning algorithm 
and select the next batch of sentences was approx- 
imately 3 minutes, a rest break for the annotators. 

The analysis of the results is presented in sec- 
tion [|. 

4 Learning by Rules 



In previous work, Brill & Ngai (1999) showed that 
under certain circumstances, it is possible for hu- 
mans writing rules to perform as well as a state- 
of-the-art machine learning system for base noun 
phrase chunking. What that study did not ad- 
dress, however, was the cost of the human labor 
and/or machine cycles involved to construct such 
a system, nor the relative cost of obtaining the 
training data for the machine learning system. 
This paper will estimate and contrast these costs 
relative to performance. 

To investigate the costs of a human rule- writing 
system, we used a similar framework to that of 
Brill & Ngai. The system was written as a cgi 
script which could be accessed across the web from 
a browser such as Netscape or Internet Explorer. 
Like Brill & Ngai's 1999 approach, our rules were 
based on Perl regular expressions. However, in- 
stead of explicitly defining rule actions and having 
different kinds of rules, our rules implicitly dehne 
their actions by using different symbols to denote 
the placement of the base noun phrase-enclosing 
parentheses prior to and after the application of 
the rule. Table |l| presents a comparison of our 
rule format against that of Brill & Ngai's. The 
rules presented here may be considered less cum- 
bersome and more intuitive. 

In a way that is similar to Brill & Ngai's sys- 
tem, our rules were translated into Perl regular 





Example Task 


Inserting New Brackets 


Splitting A Noun Phrase 


Moving A Bracket 


Brill & Ngai 

1999 
Rule Format 


TY: A 
LC: null 

TAR:({1} t=DT) (* t=JJ[RS]?) \ 
(+ t=NNP?S?) 

RC: null 


TY: S 
LC: null 

TAR1:(* t=\w+) (+ t=NNP?S?) 
MC: null 

TAR2:(* t=JJ[RS]?) ({1} w=\w+day) 
RC: null 


TY: T 

LC: <<< ({1} w=about) 
TAR:({1} t=$) (+ t=CD) 
RC: null 


Now 
Rule 
Format 


{ _DT ADJ* NOUN+ } 


[ { ANYWORD* NOUN+ } { ADJ* TIMEDAY } ] 


{ about. [ .$ NUM+ ] } 


Effect of 
Rule 
Application 


TIicot manivjv ranyBD ■ 
( ThouT man NN ) ranvefl .. 


( NcwiviVP Yorkiv np Friday nnp ) 
( New N jvj= Yorkivivp ) ( Fridayivivp ) 


about/jv ( $$ 5cd ) 
( aboutjjv $$ 5cd ) 



Table 1: Comparison of our current rule format with Brill & Ngai (1999) 



expressions and evaluated on the corpus. New 
rules are appended onto the end of the list and 
each rule applied in order, in the paradigm of a 
transformation-based rule list. 

4.1 Rule- Writing Experiments 

The rule- writing experiments were conducted by a 
group of 17 advanced computer science students, 
using the identical test set as in the annotation 
experiments and the same initial 100 gold stan- 
dard sentences for both initial bracketing stan- 
dards guidance and rule-quality feedback through- 
out their work. 

# Grab-all rule 

{ _RB : : ? ADJ* ANOUN* ADJ* AN0UN+ } 

# (blah blah last Fri)->(blah blah) (last Fri) 
{ [ ANYTHING* } { _JJ TIME_W ] } 

{ [ N0T_ADJ+ } { TIME_W ] > 

# about $8 (an ounce) -> (about $8 an ounce) 
{ (Only I only I About I about) _: :? _(\$l#)::? \ 

_CD : : + [ ANYTHING+ ] } 
{ _RBR : : * _(PDT|JJ)::? _ (DT I PRP\$ I POS) ADJ* \ 
_RB : : ? VERB? [ ANYTHING+ ] } 

# ( boy ) -> ( that boy ) 

[ (That|that)__DT { ANYTHING+ } ] 

# ( about 4 1/2 ) 

{ (only I about) _: :? (\$l#)_::? _CD::+ > 
{ [ ANYTHING+ [? ANYTHING* ]? ] _-LRB- \ 
[ ANYTHING+ ] _-RRB- [ ANYTHING+ ] } 

# Pronouns are usually baseNPs 
{ _DT: :? _PRP } 

# ''and'' usually isn't in a baseNP 
{ [ _\S+::+ ] (and|&)_ [ _\S+::+ ] } 

# more singleton baseNPs 
{ _ (DT I EX I WP I WDT) } VERB 

# some numbers are singleton baseNPs 
{ [ ANYTHING ] [ _CD ] } 

# ( much/most ) of 

{ _(DT|RB) : :? (much|most)_ } _IN 

Figure 2: An Example Rule List. Lines beginning 
with hash marks (#) are comments. 



The time that the students spent on the task 
varied widely, from a minimum of 1.5 hours to a 
maximum of 9 hours, with an average of 5 hours. 
Because we captured and saved every change the 
students made to their rule list and logged ev- 
ery mouse click they made while doing the exper- 
iment, it was possible for us to trace the perfor- 
mance of the system as a function of time. Figure 
U shows the rule list constructed by one of the sub- 
jects. The quantitative results of the rule-writing 
experiments are presented in the next section. 

5 Experiment Results — Rule 
Writing vs. Annotation 

This section will analyze and compare the per- 
formance of systems constructed with hand-built 
rules with systems that were trained from data 
selected during real-time active learning. 

The performance of Ramshaw & Marcus' sys- 
tem trained on the annotations of each subject 
in the real-time active learning experiments, and 
the performance achieved by the manually con- 
structed systems of the top 6 rule writers are 
shown in Figures || and ^, depicting the perfor- 
mance achieved by each individual system. The 
x-axes show the time spent by each human sub- 
ject (either annotating or writing rules) in min- 
utes; the y-axes show the f-measure performance 
achieved by the systems built using the given level 
of supervision. 

5.1 Analysis of Comparative 
Experimental Data 

It is important to note that when comparing the 
curves in Figure |], experimental conditions across 
groups were kept as equal as possible, with any 
known potential biases favoring the rules-writing 
group. First, both groups began with the identical 
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Figure 3: Consensus Performance for Systems constructed via Rule Writing, non-expert Annotation 
and Treebank Annotation (Individual curves are in Figure |]) 
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Figure 4: Annotation versus Rule Writing: Per- 
formance detailed by individual participant. 



100 sentence gold standard set, for initial inspec- 
tion and performance feedback throughout the 
rule-writing process. The higher starting point 
for the annotation-driven learning curves was due 
to the fact that the machine learning algorithm 
could do initial training immediately on this data. 
The rule-writing learners also received immediate 
feedback on their first rules using this data, but 
were slower to incorporate this feedback into their 
new rules. The six rule- writers used for compar- 
ative purposes were all native speakers, while the 
annotation group included 2 non-native speakers. 
Also, to further minimize the potential for any 



unknown biases in sample selection in favor of an- 
notation, the rule- writers who were evaluated and 
illustrated in these graphs were the 6 strongest 
performers out of the pool of 17; while all 7 anno- 
tation results are compared. Despite this favor- 
able treatment, rule- writing still underperforms 
annotation-based learning with statistical signif- 
icance of P < 0.02 for 100 minutes of investment, 
and with significance of P < 0.05 for times up to 
at least 2.5 hours. The high variance in the rule- 
writer pool complicates a finding of significance 
beyond this point, but at all quantities of human 
labor invested, mean annotation-based F-measure 
outperformed rule-writing and these trends ap- 
pear to extrapolate. 

5.2 Analysis of Human Performance on 
the Annotation Task 

It appears that a major limiting factor to higher 
annotation-based learning is the accuracy of the 
annotators themselves relative to the evaluation 
gold standard (the Treebank in this case). To 
study this factor, at the end of their active- 
learning experiments annotators were asked to 
annotate a further 100 sentences from the same 
test data used to evaluate the learning algorithms. 
Their F-measure performance on this data, as if 
they were a competing annotation system, is given 
m Table |. These measures of agreement with 
the gold standard effectively constitutes an upper 
bound on the performance of any system trained 
on their data. 





F-Measure Performance on 




100 held-out sentences 


Annotator 1 


92.92 


Annotator 2 


92.54 


Annotator 3 


91.27 


Annotator 4 


90.20 


Annotator 5 


88.17 


Annotator 6 


86.14 


Annotator 7 


83.86 



Table 2: Annotation Performance on 100 Test Set 
Sentences 



Thus to further put annotation-trained system 
performance in perspective, Figure |^ shows the 
performance of individual trained systems rela- 
tive to the highest achieved performance of the 
annotator on which that system was trained. In 
each case, the ratio is close to 1, indicating that 
the machine learning model achieves performance 
close to that of the annotator whose data it was 
trained on. 
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Figure 5: Performance of Ramshaw & Marcus' 
system trained on annotations by Annotator.;, as 
a percentage of that person's own performance 
measured against the Treebank (an effective upper 
bound) . 



6 Cost Models for Cross-Modal 
Learning Comparison 

Traditionally, evaluation models in the machine 
learning literature measure performance relative 
to variable quantities of training data. However, 
this measure is inappropriate for contrasting data- 
trained with rule-writing approaches where time 
and labor cost are the primary variables. Here we 
present cost models allowing these different ap- 
proaches to be compared directly and objectively. 

Below are two ways to evaluate the relative cost 
of a learning algorithm. The first measure con- 



trasts performance relative to a common standard 
of invested human labor. The second measure 
considers a fuller range of potential development 
costs in a monetary-based common denominator. 

6.1 Time-Based Cost 

For the purposes of the above experiments, we 
have chosen total human effort (in time) as 
the variable resource, which successfully maps 
machine-learning and rule-based learning on a 
common measure of training resource investment. 

Time-based evaluation is also useful for com- 
paring systems trained on different annotators. 
As shown in Figure ^, annotators varied in the 
time they used to tag the full 500-sentence data 
set. Because performance tended to be lower when 
trained on the faster annotators (perhaps due to 
less careful work), a performance-per-sentences- 
tagged measure would tend to rank these indi- 
viduals lower than their slower (but more care- 
ful) colleagues. However, when measured by the 
learning accuracy achieved per amount of annota- 
tor time invested, the faster but noisier annotators 
performed much more competitively (although the 
relative benefits of higher volumes of noisier data 
may not extrapolate well) . Annotator-time-based 
performance measures provide a useful standard 
for evaluating this volume-noise tradeoff. 

6.2 Monetary Cost 

Annotator labor doesn't capture the complete rel- 
ative cost of a given approach, however. It is use- 
ful, therefore, to measure resource investment in 
terms of the common denominator of monetary 
cost over a fuller set of potential cost variables. 
Table [| details a set of other monetary param- 
eters considered in the current studies. Given 
these parameters, one possible approximation of 
this cost function given variable time investment 
T and learning method M is: 

MonetaryCost{M, T) = IDC M + (So + AC TB ) 

+{T*{LC M + MC A )) 

Although we assume equal labor cost rates 
LCm for annotation and rule writing, these may 
substantially differ in some environments, and cer- 
tainly will be higher for professional-quality an- 
notation or rule-based development. And while 
the estimates of the machine cycle cost neces- 
sary to support this work on Linux-based PC's 
vary somewhat, they are relatively dwarfed by the 
labor costs. We have assumed that the infras- 
tructure development costs for the tagging and 
rule- writing environments, while initially variable 
across methods, have already been borne and to 



Cost Model Parameter 


Annotation 


Rule- writing 


IDCm = Infrastructure Development Cost (for tagging/RW environment) 

So — Number of initial gold standard sentences for training 
ACtb = gold standard (Treebank) Annotation Cost (per sentence) 
LCm = Labor Cost for Annotation or RW (per hour) 
MCa — Cost of Machine Cycles for Annotation/RW Support 
T = Variable time investment 


Shared 
100 

X 

$12.00/hour 
$0.24/hour 


Shared 
100 

X 

$12.00/hour 
$0.12/hour 



Table 3: Example Monetary-Based Cost Parameters for Model Comparison 



the extent that both interface systems port to new 
languages and domains with relative ease, the in- 
cremental development costs for new trials are 
likely to be relatively low and comparable. Fi- 
nally, this cost model takes into account the cost 
of developing or acquiring the S gold standard 
tagged data (e.g. from the Treebank) to provide 
initial and/or incremental training feedback to the 
annotator or rule writer to help force consistency 
with the gold standard. We have found that both 
learning modes can benefit from this high quality 
feedback. However, the cost x of developing such 
a high-quality resource for new languages or do- 
mains is unknown, and likely will be higher than 
the non-expert labor costs employed here. 

7 Rules vs. Annotation-based 
Learning — Advantages and 
Disadvantages 

In the previous sections, we investigated the per- 
formance differences and resource costs involved 
for using humans to write rules vs. using them 
for annotations. In this section, we will further 
compare these system development paradigms. 

Annotation-based human participation has a 
number of significant practical advantages relative 
to developing a system by manual rule-writing: 

• Annotation-based learning can continue in- 
definitely, over weeks and months, with rela- 
tively self-contained annotation decisions at 
each point. In contrast, rule-writers must 
remain cognizant of potential previous rule 
interdependencies when adding or revising 
rules, ultimately bounding continued rule- 
system growth by cognitive load factors. 

• Annotation-based learning can more effec- 
tively combine the efforts of multiple indi- 
viduals. The tagged sentences from different 
data sets can be simply concatenated to form 
a larger data set with broader coverage. In 
contrast, it is much more difficult, if not im- 
possible, for a rule writer to resume where 
another one left off. Furthermore, combining 



rule lists is very difficult because of the tight 
and complex interaction between successive 
rules. Combination of rule writing systems is 
therefore limited to voting or similar classifier 
techniques which can be applied to annota- 
tion systems as well. 

• Rule-based learning requires a larger skill 
set, including not only the linguistic knowl- 
edge needed for annotation, but also compe- 
tence in regular expressions and an ability to 
grasp the complex interactions within a rule 
list. These added skill requirements naturally 
shrink the pool of viable participants and in- 
creases their likely cost. 

• Based on empirical observation, the perfor- 
mance of rule writers tend to exhibit consid- 
erably more variance, while systems trained 
on annotation tend to yield much more con- 
sistent results. 

• Finally, the current performance of 
annotation-based training is only a lower 
bound based on the performance of current 
learning algorithms. Since annotated data 
can be used by other current or future 
machine learning techniques, subsequent 
algorithmic improvements may yield perfor- 
mance improvements without any change 
in the data. In contrast, the performance 
achieved by a set of rules is effectively final 
without additional human revision. 

The potential disadvantages of annotation- 
based system development for applications such 
as base NP chunking are limited. Given the 
cost models presented in Section ||, one poten- 
tial negative scenario would be an environment 
where the machine cost significantly outweighed 
human labor costs, or where access to active learn- 
ing and annotation infrastructure was unavailable 
or costly. However, under normal circumstances 
where machine analysis of text is pursued, and 
public domain access to our annotation and ac- 
tive learning toolkits is assumed, such a scenario 
is unlikely. 



8 Conclusion 

This paper has illustrated that there are poten- 
tially compelling practical and performance ad- 
vantages to pursuing active-learning based anno- 
tation rather than rule-writing to develop base 
noun phrase chunkcrs. The relative balance de- 
pends ultimately on one's cost model, but given 
the goal of minimizing total human labor cost, it 
appears to be consistently more efficient and ef- 
fective to invest these human resources in system- 
development via annotation rather than rule writ- 
ing. 
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